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Abstract 

In this work, we consider a retailer selling a single product with limited on-hand inventory 

over a finite selling season. Customer demand arrives according to a Poisson process, the rate 
of which is influenced by a single action taken by the retailer (such as price adjustment, sales 
commission, advertisement intensity, etc.). The relation between the action and the demand 
rate is not known in advance. The retailer will learn the optimal action policy "on the fly" as 
she maximizes her total expected revenue based on observed demand reactions. 

Using the pricing problem as an example, we propose a dynamic "learning-while-doing" 
algorithm to achieve a near optimal performance. Furthermore, we prove that the convergence 
rate of our algorithm is almost the fastest among all possible algorithms in terms of asymptotic 
"regret" (the relative loss comparing to the full information optimal solution). Our result closes 
the performance gaps between parametric and non-parametric learning and between a post- 
price mechanism and a customer-bidding mechanism. Important managerial insights from this 
research are that the value of information on 1) the parametric form of demand function and 
2) each customer's exact reservation price are rather marginal. It also suggests the firms would 
be better off to perform concurrent dynamic learning and doing, instead of learning-first and 
doing-second in practice. 

1 Introduction 

Revenue management is one of the central problems for many industries such as airlines, hotels, 
and retailers who sell fashion goods. In revenue management problems, the availability of 
products is often limited in quantity and/or time, and the customer demand behavior is either 
unknown or uncertain. However, demands can be influenced by actions such as price adjustment, 
advertisement intensity, sales person compensation, etc. Thus, retailers are interested in flnding 
an optimal action policy to maximize their revenue in such an environment. 

Most existing research in revenue management assumes that the functional relationship be- 
tween demand distribution (or the instantaneous demand rate) and retailers' actions is known 
to the decision makers. This relationship is then exploited to derive optimal policies. However, 
in reality, decision makers seldom possess such information. This is especially true when a new 
product/service is provided at a new location or the market environment has changed. In light 
of this, some recent research has proposed learning methods that allow decision makers to learn 
the demand functions while optimizing their policies based on up-to-date information. 
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There are two types of learning models; parametric and non-parametric. In parametric 
approaches, people assume that prior information has been obtained about which parametric 
family the demand function belongs to. Decision makers then take actions "on the fly" while 
updating their beliefs about the underlying parameters. On the other hand, in non-parametric 
approaches, people do not assume any structural properties of the demand function except 
some basic regularity conditions. And it is the decision maker's task to learn the demand 
curve with very limited information. Intuitively, the non-parametric approaches are harder 
than the parametric counterparts since the non-parametric function space is much larger than 
the parametric one. However, the exact difference between these two models is not clear and 
several questions are to be studied: First, what are the "optimal" learning strategies for each 
setting? Second, what are the minimal revenue losses that have to be incurred over all possible 
strategies? Third, how valuable is the information that the demand function belongs to a 
parametric family? Besides, it seems quite advantageous for the retailer to be able to obtain 
each customer's exact valuation rather than only observing a "yes-or-no" purchase decision. But 
how much value is added? 

In this paper, we attempt to provide a complete answer to these questions using a pricing 
model as example where the retailer's action is to control the price. The reason we choose the 
pricing problem is two-fold: First, the pricing problem is well-studied in the literature so that 
our results can be directly positioned and compared; Second, in the pricing problems, there are 
two mechanisms, the customer-bidding mechanism where the valuation of each customer is fully 
revealed, and the post-price mechanism where only a binary information of customer's purchase 
decision is observed. Intuitively, the customer-bidding mechanism would be more efficient (given 
other conditions the same) since it extracts more information from each customer. However, 
one of the implications of our results indicates that the two mechanisms have the same level of 
efficiency. 

In the pricing problem, a retailer is facing a given initial inventory and a finite selling season. 
The demand is formulated as a Poisson process whose intensity at each time is controlled by the 
prevailing price posted by the decision maker. We are interested in the case where the demand 
function is not known to the decision maker and the information regarding the demand function 
can only be obtained through observing realized demand. Specifically, we focus ourselves on 
the non-parametric setting where only some regularity conditions are assumed on the demand 
functions. We propose a dynamic price learning algorithm for this case and show that our 
algorithm is near "optimal" in the sense that no pricing policy can achieve a much better 
performance (which we will precisely define later) than the one generated by our algorithm. 
Consequently, the optimal learning strategy for the parametric and non-parametric cases are 
the same and the minimal revenue loss is of the same level. 

As discussed in much literature, the key of a good pricing algorithm under demand un- 
certainty lies in its ability to balance the tension between demand learning (exploration) and 
near-optimal pricing (exploitation). The more time one spends in price exploration, the less 
time remains to exploit the knowledge to obtain the optimal revenue. On the other hand, if not 
enough price exploration is performed, then one may not be able to find a price good enough 
to achieve satisfactory revenue. This is especially true in the non-parametric setting where it 
is harder to infer structural information from the observed demand. Previously, researchers 
proposed price learning algorithms with separated learning and doing phases, where a grid of 
prices are tested and then the optimal one is used for pricing. Theoretic results are established 
to show that those algorithms achieve asymptotic optimality at a decent rate, see Besbes and 
Zeevi [8]. 

One of our main contributions in this paper is to propose a dynamic price learning algorithm 
that iteratively performs price experimentation within a shrinking series of intervals that always 
contain the optimal price (with high probability). We show that this dynamic price learning 
algorithm achieves better asymptotic revenue (in terms of the regret, which is the relative loss to 
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the case when the demand function is known exactly, and as problem size grows large) than the 
grid learning strategy. By showing a worst-case example for all possible policies, we prove that 
our algorithm provides the near best asymptotic performance over all possible pricing policies. 
To our best knowledge, this is the first time such an algorithm is proposed and analyzed. This 
result suggests that we should not separate price experimentation and exploitation, instead, we 
should combine "learning" and "doing" in a concurrent procedure, which might be of interest 
to revenue management practitioners. 

In more detail, we summarize our contribution of this work in the following: 

1. Under some mild regularity conditions on the demand function, we show that our pricing 
policy achieves a regret arbitrarily close to 0{n~^/'^) (in terms of the order of n), uniformly 
across all possible demand functions. This result improves the best-known bound (of the 
asymptotic regret) by Besbes and Zeevi [8] for both the non-parametric learning (the best 
known bound was 0(n~^/'')) and the parametric learning (the best known bound was 
O (n^^/'^) except for the case with only one parameter) in this context. Thus, it closes the 
efficiency gap between parametric and non-parametric learning in this setting. It implies 
that there is no additional cost associated with performing non-parametric learning, in 
terms of asymptotic regret. Therefore, our study suggests that firms could save time and 
effort on checking which class of parametric functions the demand belongs to and collecting 
data for curve fitting, which is widely used in practice. 

2. Our result also closes the gap between two revenue management mechanisms: the customer- 
bidding mechanism and the post-price mechanism. In Agrawal et al the authors ob- 
tained a dynamic learning algorithm with 0(n~^/^) regret under the former mechanism 
(in slightly different setting). However, under the post-price model, the previous best al- 
gorithm by Besbes and Zeevi [H| achieves a regret of 0(n~'^/'^). Our result asserts that 
although post-price mechanism extracts much less information from each individual cus- 
tomer's valuation of the product, it can achieve the same order of asymptotic behavior as 
that in the customer-bidding mechanism. Therefore, our result reassures the usage of the 
post-price mechanism, which is more widespread in practice. 

3. On the methodology side, our algorithm provides a new form of a dynamic learning method. 
In particular, we do not separate the "learning" and "doing" phases; instead, we integrate 
"learning" and "doing" together by considering a shrinking series of price intervals. This 
concurrent dynamic is actually the key to achieve a perfect balance between price explo- 
ration and exploitation, and thus achieve the near maximum efficiency in pricing. We 
believe that this method may be applied to problems with even more complex structure. 

The rest of our paper is organized as follows: In Section 2, we review related literature in 
this field. In Section 3, we introduce our model and state our main assumptions. In Section 4, 
we present our dynamic price learning algorithm and the main theorems. We provide a sketch 
proof for our algorithm in Section 5. In Section 6, we show a lower bound of regret for all 
possible pricing policies. We show some numerical results in Section 7 and some extensions of 
this model in Section 8. We conclude this paper in Section 9. An Appendix is then followed for 
the detailed proofs for our technical results. 

2 Literature Review 

Pricing mechanisms have been an important research area in revenue management and there 
is abundant literature on this topic. For a comprehensive review on this subject, we refer 
the readers to Bitran and Caldentey [10], Elmaghraby and Keskinocak [M] and Talluri and 
van Ryzin [24j . Previously, research has mostly focused on the cases where the functional 
relationship between the price and demand (also known as the demand function) is given to the 



3 



decision maker. Gallego and van Ryzin |16j present a foundational work in such setting where 
the structure of the demand function is exploited and the optimal pricing policies are analyzed. 

Although knowing the exact demand function is convenient for analysis, the decision makers 
in practice do not usually have such information. Therefore, much recent literature addresses 
the dynamic pricing problems under demand uncertainty. The majority of these work take 
the parametric approach, e.g., Lobo and Boyd [21 , Bertsimas and Parekis [7j, Araman and 
Caldentey [3], Aviv and Pazgal [5], Carvalho and Puterman [T3], Farias and Van Roy [TS], 
Broder and Rusmevichientong [T^ and Harrison et al fTT'. Typically in these works, a prior 
knowledge of the parameters is assumed and a dynamic program with Bayesian updating of 
the parameters is formulated. Although such approach simplifies the problem to some degree, 
the restriction to a certain demand function family may incur model misspecification risk. As 
shown in Bcsbes and Zeevi 8 , misspecifying the demand family may lead to revenue far away 
from the optimal. In such case, a non-parametric approach would be preferred since it does not 
commit to any family of demand function. 

The main difficulty facing the non-parametric approach is its tractability and efficiency. And 
most research revolves around this idea. Several studies consider a model that the customers 
are chosen adversarially, e.g. Ball and Queyranne and Perakis and Roels [22] • However, their 
models take a relatively conservative approach and no learning is involved. Rusmevichientong 
et al |23j consider static learning using historic data with no dynamic decision being made. In 
another paper by Lim and Shanthikumar |20j , they consider dynamic pricing strategies that are 
robust to an adversarial at every point in the stochastic process. Again, this approach is quite 
conservative and the main theme is about robustness rather than demand learning. 

The work that is closest to this paper is that of Besbes and Zeevi [S] where the authors con- 
sidered demand learning in both parametric and non-parametric case. They proposed learning 
algorithms for both cases and showed that there is a gap in performance between them. They 
also provided a lower bound for the revenue loss in both cases. In this paper, we continue their 
work by improving the bound for both cases and closing the gap between them. In particular, 
they considered algorithms with separated learning and doing phases where price experimenta- 
tion is performed exclusively during the learning phase (except the parametric case with single 
parameter). In our paper, the learning and doing is dynamically integrated: we keep shrinking 
a price interval that contains the optimal price and keep learning until we guarantee that the 
revenue achieved by applying the current price is near-optimal. Although our setting resembles 
theirs, our algorithm is quite different and the results are stronger. 

Another paper that is related to ours is Kleinberg and Leighton [TS] . In [TB] , they considered 
the online post-price auction with unlimited supply. They showed lower bounds for the revenue 
loss for three cases: 1) each customer has a same (deterministic) but unknown valuation, 2) 
each customer has an i.i.d. unknown valuation, and 3) the valuations of customers are cho- 
sen adversarily to the algorithm. They also provided algorithms that match these three lower 
bounds for each case. Specifically, for the case where each customer has an i.i.d. valuation, 
they presented an algorithm with the same level of regrets as ours. However, their model is 
different from ours in several ways. First, they considered an unconstrained revenue maximiza- 
tion problem (without inventory constraint) while we consider a constrained problem (with an 
inventory constraint). Second, they considered a discrete-time arrival model while we consider 
a continuous-time Poisson arrival model. As we will see in our algorithm, these differences are 
nontrivial and our analysis is fundamentally different from theirs. 

Other related literatures that focus on the exploration-exploitation trade-off in sequential 
optimization under uncertainty are from the study of the multi-armed bandit problem: see Lai 
and Robbins [12], Agrawal [T] and Auer et al [3] and references therein. Although our study 
shares similarity in ideas with the multi-armed bandit problem, we consider a problem with 
continuous learning horizon (the time is continuous), continuous learning space (the possible 
demand function is continuous) and continuous action space (the price set is continuous). These 
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features in addition to the presence of inventory constraint make our algorithm and analysis 
quite different from theirs. 



3 Problem Formulation 

3.1 Basic Model and Assumptions 

In this paper, we consider the problem of a monopolist selling a single product in a finite selling 
season T. The seller has a fixed inventory x at the beginning and no recourse actions on the 
inventory can be made during the selling season. During the selling season, demand of this 
product arrives according to a Poisson process with intensity at time t being At where At is the 
instantaneous demand rate at time t. In our model, we assume that Af is solely determined by 
the price offered at time t, that is, we can write At — \{p{t)) as a function of p{t). At time T, 
the sales will be terminated and there is no salvage cost for the remaining items (As shown in 
[16], the zero salvage cost assumption is without loss of generality) 

We assume the feasible set of prices is an interval [p,p] with an addition cut-off price poo 
such that A(poo) = 0- Regarding the demand rate function A(p), we assume it is decreasing in 
p, has an inverse function p = 7(A), and the revenue rate function r(A) = A7(A) is concave in A. 
These assumptions are quite standard and such demand functions bear the name of "regular" 
demand function as defined in |16j . 

Besides being regular, we also make the following assumptions on the demand rate function 
X{p) and the revenue rate function r(A): 



Assumption A. For some positive constants M, K, rriL and m^/, 

1. Boundedness: |A(p)| < M for aU p e [p,p\; 

2. Lipschitz continuity: \{p) and r(A(p)) are Lipschitz continuous with respect to p with 
factor K. Also, the inverse demand function p = 7(A) is Lipschitz continuous in A with 
factor K\ 

3. Strict concavity and differentiability: r"{\) exists and — < r"(A) < —mu < for all A 
in the range of X{p) for p G [p,p]. 

In the following, let F = r{M, K,m]^,mij) denote the set of demand functions satisfying the 
above assumptions with the corresponding coefficients. We briefiy illustrate these assumptions 
as follows: The first assumption is just an upper bound on the demand rate. The second 
assumption says that when we change the price by a small amount, the demand and revenue rate 
will not change by too much, also the demand function does not have a "fiat" period. These two 
assumptions are quite standard as they appear in most literature in revenue management with 
demand learning, e.g., in |16j . [5] and |12| . The last assumption contains two parts, one being 
the smoothness of the revenue function, the other being strict concavity. The assumption on the 
existence of second derivatives is also made by [12 and J^, and the strict concavity assumption 
is made by |lH|. And as shown in Appendix 10.1 our assumptions hold for several classes of 
commonly-used demand functions (e.g., linear, exponential, and logit demand functions). 

In our model, we assume that the seller does not know the true demand function A, the 
only knowledge she has about A is that it belongs to F. Note that F doesn't need to have any 
parametric representation. Therefore, our model is robust in terms of the choice of the demand 
function family. 



3.2 Performance Analysis 

To evaluate the performance of any pricing algorithm, we adopt the minimal regret objective 
formalized in [5!. Consider a pricing policy tt. At each time t, tt maps all the history price 
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and realized demand information into a current price p(t) . By our assumption that the demand 
follows a Poisson process, the cumulative demand up to time t can be written as follows: 

N^t) = N (^J\{p{s))ds^ (1) 

where N(-) is a unit-rate Poisson process. In order to satisfy the inventory constraint, any 
admissible policy tt must satisfy: 

T 

dN^'is) < X (2) 



p{s) e[p,p]Up^ 0<s<T 

We denote the set of policy satisfying ([2]) by T'. Note the seller can always set the price to 
Poo, thus constraint ^ can always be met. The expected revenue generated by a policy tt is as 
follows: 

/•T 



J''{x,T; X) = E 



p(s)d7V^(s) 







(3) 



Here, the presence of A in ([3| means that the expectation is taken under the demand function A. 
Given a demand function A, we wish to find the optimal policy tt* that maximizes the expected 
revenue ^ while subjected to the inventory constraint In our model, since we don't have 
perfect information on A, we seek a pricing policy tt that performs as close to tt* as possible. 

However, even if the demand function A is known, computing the expected value of the 
optimal policy is hard. It involves solving a Bellman equation resulting from a dynamic program. 
Fortunately, as shown in many previous literatures [8], [16], we can obtain an upper bound for 
the expected value for any policy via considering a full-information deterministic optimization 
problem. Define: 

J^(x,T;A)= sup J^r{X{p{s)))ds 

s.t. ffx{p{s))ds<x (4) 

p{s)e[p,p]Upao Vse[0,T]. 

In Q all the stochastic processes are substituted by their means. In [S], the authors showed 
that J^{x, T; A) provides an upper bound on the expected revenue generated by any admissible 
pricing policy tt, that is, J''(x,T;A) < J^{x,T-X), for all A e P and tt eV. With this useful 
relaxation, we can define the regret (x, T; A) for any given demand function A € P and policy 
TT € T' to be 

As we mentioned above, the deterministic optimal solution J^{x, T; A) provides an upper bound 
of the expected value of any policy tt, therefore R'^{x^T; A) is always greater than 0. And by 
definition, the smaller the regret, the closer tt is to the optimal policy. However, since the decision 
maker does not know the true demand function, it is attractive to obtain a pricing policy tt that 
achieves small regrets across all the underlying demand function A G P. In particular, we want 
to consider the "worst-case" regret, where the decision maker chooses a pricing policy tt, and 
the nature picks the worst possible demand function for that policy: 

supi?''(a:,T; A). (6) 

AGT 

Obviously, the seller wants to minimize the worst-case regret, i.e., we are interested in solving: 

inf supi?'^(2;,T; A). (7) 
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Now our objective is clear. However, it is hard to evaluate Q for any single problem. Therefore, 
in this work, we adopt the widely-used asymptotic performance analysis (the regime of high 
volume of sales) . We consider a regime in which both the size of the initial inventory, as well as 
the potential demand, grow proportionally large. In particular, for a market of size n, where n 
is a positive integer, the initial inventory and the demand function are given by 

Xn = nx and A„(-) — nX{-). (8) 

Denote {x,T; X) to be the deterministic optimal solution for the problem with size n; it is 
easy to see that — nj['. We also define J^{x,T; A) to be the expected value of a pricing 
algorithm tt when it is applied to a problem with size n. Therefore, we can define the regret for 
the size-n problem R^{x,T;X) as 

and our objective is to study the asymptotic behavior of R^ix, T; A) as n grows large and design 
an algorithm with small asymptotic regret. 

4 Main Results: A Dynamic Pricing Algorithm 

In this section, we introduce our main results: an optimal dynamic pricing algorithm. Before 
we state our theorems, it is useful to introduce some basic structural intuition of this problem. 

4.1 Structural Insights 

Consider the full-information deterministic problem Q. As shown in the optimal solution 

to g is given by 

=max{p",p"} (10) 

where 

— arg max {r(A(p))}, (11) 

pS [p,p] 

X 

= arg niin \X{p) - - \. (12) 
pe [p,p] i 

Here, the superscript u stands for "unconstrained" and superscript c stands for "constrained" . 



As shown in (10), the optimal price is either the revenue maximizing price, or the inventory 
depleting price, whichever is larger. It is shown in 116) that if one knows , then the revenue 
collected by using a fixed price p^ will be close to the deterministic optimal solution. Therefore, 
the goal of our algorithm will be to learn an estimate p^ close enough to the true one, using em- 
pirical observations at hand. We make one technical assumption about the value of p^ as follows. 

Assumption B. There exists e > 0, such that p^ £ [p -I- e,p — e] for all A G F. 

Assumption B says that we require the optimal deterministic price to be in the interior of 
the price interval. This assumption is mainly for the purpose of analysis and is quite general, 
since one can always choose a large interval of [p, p] to start with. 

4.2 An Optimal Dynamic Pricing Algorithm 

We first state our main result as follows: 
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Theorem 1. Let Assumptions A and B hold for T ~ r(Af, if, to^, mu) and a fixed e. Then for 
any 5 < 1/2, there exists a policy irs G V generated by Algorithm DPA, such that for all n> 1, 



supJ?-^(x,T;A)< ' (13) 



for some constant C . 



Here C only depends on M, K,mL,mij,e, the initial inventory x and the length of time 
horizon T. However, the exact dependence is quite complex and thus omitted. A corollary of 
Theorem [T] follows from the relationship between the non-parametric model and the parametric 
one: 

Corollary 1. Assume T is a parameterized demand function family satisfying Assumption A 
and B for some coefficients. Then for any S < 1/2, there exists a policy ng £ V generated by 
Algorithm DPA, such that for all n > I, 

supi?-(x,T;A)<^i^^^ (14) 

for some constant C . 

We also establish a lower bound of asymptotic regret for any possible pricing policies: 

Theorem 2. There exists a set of demand functions T parameterized by a single parameter 
satisfying Assumption A and B with certain coefficients, such that for any admissible pricing 
policy TT, for all n > 1 

snpRlix,T;X)>-^ (15) 
Aer V»^ 

for some constant C that only depends on the coefficients in T, x and T. 

Theorems [T] and [2] together provide a clear answer to the magnitude of regret the best pricing 

policy can achieve, under both parametric and non-parametric setting^ 

Now we describe our algorithm. As we mentioned in Subsection 4.1 we aim to learn p^ 



through an iterative price experimentation. Specifically, our algorithm will be able to distinguish 
whether "p"" or "p'^" is optimal. Meanwhile we keep shrinking an interval containing the optimal 
price until a certain accuracy is achieved. 

Now we present our dynamic pricing algorithm. Explanations and illustration of the algo- 
rithm will follow right after. 



Algorithm DPA (Dynamic Pricing Algorithm) : 

Step 1. Initialization: 

(a) Consider a sequence of rf, k", i = 1, 2, TV" and rf , , i = 1, 2, A'"'^ (r and k represent 
the time of each learning period and the number of different prices considered in each 
learning period, respectively. Their values along with the value of and N'^ will be 
defined later). Define p" = p"^ — p and Pi=Pi= P- Define — J2]=i "^/j for i = to 
and = for z = to 7^=; 

Step 2. Learnp" or determine p"^ > p": 

For i = 1 to A^" do 
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(a) Divide into equally spaced intervals and let {Pij,j = 1,2, ...,k"} be the left 

endpoints of these intervals 

(b) Divide the time interval into k" equal parts and define 

(c) Apply pfj from time tf j_^ to tfj, as long as the inventory is still positive. If no more 
units are in stock, apply Poo until time T and STOP 

(d) Compute 

total demand over J* • 



(e) Compute 



and 



Pi = arg {pljd{plj)} 



Pi = arg mm \d{plj) - x/T\ 

1<J<K? 



(f) If 

p1>p^ + 2^/^-'^^^ (16) 



then Break from Step 2, Enter Step 3 and denote this i to be io; 
Else, set pi = max{p?,p"}. Define 

pU=p,-'^.^IzK (17) 

and 

2 log n pt -P^ 

pr+i=P. + ^-'^ (18) 

And define the price range for the next iteration 

Here we truncate the interval if it doesn't lie inside the feasible set [p, p] 
(g) If i = Af", then Enter Step 4(a); 

Step 3. Learn when p"^ > p": 

For i = 1 to N'' do 

(a) Divide into equally spaced intervals and let {Pij,j = 1,2, ...,k?} be the left 
endpoints of these intervals 

(b) Define 

A?=^' ti,j=ti-i+j^i+Uo, j = 0, 

(c) Apply p^ J from time tf ,j_i to tij, as long as the inventory is still positive. If no more 
units are in stock, apply Poo until time T and STOP 
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(d) Compute 

3, c ^_ total demand over [t-j.i,^-,^) . _ 



(e) Compute 



Qi = arg min ■) - x/T| 

1<j<k| '■' 



Define _ 

_ logn P?-^ 

2 

and _ 

+ (20) 

And define the price range for the next iteration 

Here we truncate the interval if it doesn't lie inside the feasible set of [p, p] 
(f) If i = N", then enter Step 4(b); 

Step 4. Apply the learned price: 



(a) Define p = pjv" + 2^/\ogn- " ^ . Use p for the rest of the selling season until the stock 
runs out; 

(b) Define q = q^o. Use q for the rest of the selling season until the stock runs out. 



Now we explain this algorithm before we proceed to proofs. The idea of this algorithm is to 
divide the time interval into pieces, and in each piece, we test a grid of prices on a price interval. 
We find the empirical optimal price, then shrink the price interval to a smaller subinterval that 
still contains the optimal price (with high probability), and enter the next time interval with the 
smaller price range. We repeat the shrinking procedure until the price interval is small enough 
so that the desired accuracy is achieved. 

Recall that the optimal deterministic price p^ is equal to the maximum of and p'^, where 
p" and are solved from (11) and (12 1 respectively. It turns out that (111 and (12) have quite 
different local behaviors around its optimal solution under our assumptions: (11) resembles a 
quadratic function while ( [12^ resembles a linear function. This difference requires us to have 
different shrinking strategies for the case when p" > p'^ and p'^ > p". This is why we have two 
learning steps (Step 2 and 3) in our algorithm. Specifically, in Step 2, the algorithm works by 
shrinking the price interval until either a transition condition ( 16 ) is triggered or the learning 
phase is terminated. As will be shown later, when the transition condition ( 16 ) is triggered, 
with high probability, we are certain that the optimal solution to the deterministic problem is 
p'^. Otherwise, if we terminate learning before the condition is triggered, we know that p" is 
either the optimal solution to the deterministic problem or it is close enough so that using 
will also give us a near-optimal revenue. When the transition condition ( 16 ) happens, we switch 
to Step 3, where we use a new set of shrinking and price testing parameters. Note that in Step 
3, we start from the initial price interval rather than the current interval obtained. This is not 
necessary but for the ease of analysis. Both Step 2 and Step 3 (if it is invoked) must terminate 
in a finite number of iterations (we prove this in Lemma [ij . 

After this learning-while-doing period ends, a fixed price is used for the remaining selling 
season (Step 4) until the inventory runs out. To help illustration, a high-level description of 
the algorithm is shown below. One thing to note is that the "next" intervals defined in (17) 
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and (18) are not symmetric. Similarly in Step 4(a), we use an adjusted price for the remaining 
selling season. This adjustment is a trick to make sure that the inventory consumption can be 
adequately upper bounded. Meanwhile the adjustment is small enough so that the revenue is 
maintained. It is based on the different local behaviors of the revenue rate function and the 
demand rate function. The detailed reasoning of this adjustment will be clearly illustrated in 
Lemma 11. 



High-level description of the Dynamic Price Learning Algorithm: 
Step 1. Initialization: 

(a) Initialize the time length and price granularity for each learning period. Set 
the maximum number of iterations A^" and N"^ in Step 2 and 3; 

Step 2. Learn or determine p'^ > p": 

(a) Set the initial price interval to be [p,p] 

(b) Test a grid of prices on the current price interval for a predetermined length of 
time, observe the demand for each price 

(c) Compute the empirical optimal p'^ and p" using the observed demand 

(d) If p'^ is "significantly" greater than p", then enter Step 3; otherwise, shrink the 
current interval to a subinterval containing the empirical optimal p^ 

(e) Repeat (b)-(d) for A^" times and then enter Step 4(a); 
Step 3. Learn p"^ when p"^ > p": 

(a) Set the initial price interval to be [p,p] 

(b) Test a grid of prices on the current price interval for a predetermined length of 
time, observe the demand for each price 

(c) Compute the empirical optimal p'^ using the observed demand 

(d) Shrink the current interval to a subinterval containing the empirical optimal p'^ 

(e) Repeat (b)-(d) for N'^ times and then enter Step 4(b); 
Step 4. Apply the learned price: 

(a) Apply the last price in Step 2 until the stock runs out or the selling season 
finishes 

(b) Apply the last price in Step 3 until the stock runs out or the selling season 
finishes. 



In the following, without loss of generality, we assume T — 1 and p — p — 1. Now we define 
Tf-,K'^,N^,Tf,Kl and N'^. We first show a set of equations we want {t^,k^) and (rf,K?) to 
satisfy. Then we explain the meaning of each equation and solve those equations. Finally we 
will prove our main theorem on the asymptotic performance of our algorithm. 
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Now we state the set of equations wc want r" and to satisfjj^ 

,Vi = l,...,7V" (21) 




P^+i-p:+i-^^S^-^^ ,Vz = l,...,Ar"_l (22) 



'i+l 



2 



•yi^^Ti" ,Vi = l,...,iV"-l (23) 



Also we define 



iV" = min{;| (^^^^) < rn (24) 

We then state the set of equations we want and to satisfy: 



2 



,Vi-l,...,iV^ (25) 



~log"-^^ ,Vz = l,...,iV^-l (26) 



P? - P ■ 



yiog^-rf ,Vz = l,...,A^^-l (27) 



Also we define 



N" = min{/| ^' Vlog» < Til (28) 



Before we explain these desired relationships, let us first examine where the revenue loss comes 
from in this algorithm. First, in each period, there is a so-called exploration bias, that is, 
the prices tested in each period may deviate from the optimal price, resulting in suboptimal 
revenue rate or suboptimal demand rate. These deviations multiplied by the time length of 
each period will be the "loss" for that period. Second, since we only explore a grid of prices, 
there is also a deterministic error associated with it. Thirdly, since the demand is essentially a 
stochastic process, the observed demand rate may deviate from the true demand rate, resulting 
in a stochastic error. Note that these three errors also exist in the learning algorithm proposed 
in [8]. However, in our dynamic learning case, each error does not simply appear once. For 
example, the deterministic error and stochastic error in one period may have impact on all the 
future periods. Thus, the design of our algorithm will revolve around the idea of balancing these 
errors in each step to achieve the maximum efficiency of learning. With this in mind, we explain 
the meaning of each equation above in the following: 



The first equation (21) ((25), resp.) balances the deterministic error induced by only 



considering the grid points (in which the grid granularity is ( , resp.)) and the 

stochastic error induced in the learning period which is \J^^ {\J resp.). These two 
terms determine the price deviation in the next period and thus the exploration error of 
next period. We will show that under our assumptions, the loss due to price granularity 
is quadratic in Step 2, and linear in Step 3. We balance these two errors to achieve the 
maximum efficiency in learning. 



^Here / ~ gr means that / and g are of the same order of n 
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The second equation ( |22[ ) ((261, resp.) is used to make sure that with high probabihty, 
our learning interval /" {If, resp.) contains the optimal price . We have to guarantee 
that /" {If, resp.) contains , otherwise when we miss the optimal price, we will incur a 
constant exploration error in all periods afterwards. This relationship is actually given in 



the algorithm (see ( [T7| ), (18l,(19) and (20)). However, we include them here for the sake 
of completeness. 



The third equation (23) ((|27[), resp.) is used to bound the exploration error for each 
learning period. This is done by considering the multiplication of the revenue rate deviation 
(also demand rate deviation) and the length of the learning period, which in our case is 

T'j'^|_iVlogn • ^ ^'^„~' ^ {Tf_^_iy^logn ■ , resp.). We want this loss to be on the same 

order for each learning period (thus all equal to the loss in the first learning period, which 
is Ti) to achieve the maximum efhciency of learning. 



• The fourth equation (24) ((28), resp.) determines if the price we obtain is guaranteed to be 
close enough to optimal such that we can apply this price in the remaining selling season. 

We show that ^/logn ■ ( ^' „~' ) (y'logn • ^' , resp.) is the revenue rate and demand 

rate deviation of price pi . When this is less than ti , we can simply apply pi and the loss 
will not exceed the loss of the first learning period. 

Now we solve for nf and r" from the relations defined above. Define t" — nT^ ■ (logrt)^'^, one 
can solve (21), (22) and (23) and get: 



l-25-(l-5).(f)' 



logTi, = l,2,...,iV" 
•(logr^)^ Vi = 1,2,...,7V" 



And as a by-product, we have 



-i(l-5)(l-(f)'"i)) 



Vi = 1,2, ...,7V" 



(29) 
(30) 

(31) 



Next we do a similar computation for Hif and rf. Define rf = n ^ ■ (logn 
following results: 



,2.5 



We have the 



— 7^3(3)' "5) . log 71 

l-25-(l-5).(f)'-i 



= 1,2,..., N" 



(logn) 



Vi = 1,2, 



and 



'\ Vz = l,...,7V^ 



(32) 
(33) 

(34) 



5 Proof Outlines for Theorem [T] 

In this section, we give an outline of the proof of Theorem [ij We put most of the detailed proof 
in the Appendix, only the main steps and ideas are presented in this section. 

As the first step of our proof, we show that given 5 < 1/2, our algorithm will stop within a 
finite number of iterations. Also the number of iterations is uniformly bounded. We have the 
following lemma: 



Lemma 1. Fix 6 < 1/2. A^" and N"^ defined in (24) and exist. Moreover, there exists an 



Ng independent of n such that A" and N'^ are both bounded by Ns . 
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Proof. See Appendix (TaS] □ 

Although Lemma [l] is simple, it is important since it allows us to treat the number of itera- 
tions of our algorithm as constant. In much of our analysis, we frequently need to take a union 
bound over the number of iterations, and Lemma[l]justifies such analysis. In our algorithm, it is 
important to make sure that the deterministic optimal price is always contained in our price 
interval. This is because when we miss the deterministic optimal price, we will incur a constant 
loss for all periods afterwards, and thus the algorithm can not achieve asymptotic optimality. 
The next lemma will show exactly such behavior of our algorithm. 

Lemma 2. Assume is the optimal price for the deterministic problem and Assumption A 
and B hold for T = T{M, K, m^, rnu) and e > 0. Then with probability 1 — O (^), 

• // we never enter Step 3, then p^ G /" for all i = 1,2, 

• // Step 2 stops at Iq and the algorithm enters Step 3, then p^ E I" for all i = 1, 2, Iq 
and p° e for all j = 1, 2, 7V= 

Proof. Here we give a sketch of proof for the first part of this lemma. The detailed proof is 
given in Appendix |10.4| 

We prove by induction on i. Assume p^ G We consider the {i + l)th iteration. Define 




<=logn.max|^— (35) 
Denote the unconstrained and constrained optimal solutions on the current interval to be pf 



and pf. We can show (the details are in Appendix 10.4) that with probability 1 — O(^), 
IPi ~ Pi\ < ^\fKi IPi ~V\\< Ca/m^ (in our analysis, for simplicity, we use C to denote a 
generic constant, the relations between them may not be specified). Therefore, with probability 
1 — O (^), \pi — p^\ < Cy/v^. On the other hand, the length of the next price interval (the 
center is near pi) is of order y/\ogn greater than ^^1/^. Therefore, with probability 1 — O (;^), 
p^ e -^i+i- Then we take a union bound over all i and the first part of the lemma holds. □ 



Next we show that if condition (16) is triggered, then with probability 1 — O (-) , p'^ > p" 



An equivalent expression is that if p" > p'^ , then with probability 1 ~ O (^) , condition ( 16 1 will 
not be triggered. We will use this fact many times in the future, so we formalize this into the 
following lemma. 

Lemma 3. If p^ > p'^, then with probability 1 — (^), our algorithm will not enter Step 3 
before stopping; 

Proof. The proof of this lemma follows from the proof of Part 2 in Lemma 2. See Appendix 
[TOl[ □ 

Remark. Lemma [3] says that if > P^, then our algorithm will not enter Step 3. When 
pC ^ pU jiQ-^gver, it is also possible that our algorithm will not enter Step 3, but as we will show 
later, in that case, must be very close to p'^ so that the revenue collected is still near-optimal. 

Now we have proved that with high probability, p^ will always be in our price interval. Next 
we analyze the revenue collected for this algorithm and therefore prove our main theorem. 

We first prove the case when > p'^. We prove: 



3.5 



Proposition 1. When > p", snp^^j, Rl{x, T; A) < '^^'"^"^ . 

Consider a problem with size n. Recall that J,^(a;, T; A) is the expected revenue collected by 
our algorithm given that the underlying demand function is A. Define Y^" to be Poisson random 

variables with parameters X{pf j)nAf (Y^ — A^(A(p-'j)nA")|^ Also define F" to be a Poisson 

random variable with parameter A(p)n(l — f^^) (y" = N{X(p)n{l — f^^))). 



2 



We remove the dependence on n in the notation. If not otherwise stated, it is assumed we are talking a problem 



with size n. 
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We define the following events (/(•) denotes the indicator function of a certain event): 

A2 ={uj : The algorithm never enters Step 3 and G Vi = 1, 2, iV"}. 
We have 

j:{x,T;X) > + i?[pmin(y",(nx-^y,;;)+)/(A5')/(A^)]. (36) 

i=l j=l ij 



In the following, we will consider each term in (36 1. We will show that the revenue collected 
in both parts is "close" to the revenue generated by the optimal deterministic price on that 
part (and the consumed inventory is also near-optimal). We first have: 

Lemma 4. 

TV" ft" N" 
1=1 j = l i=l 

Proof. This proof analyzes the exploration error in each iteration. Under A2, the exploration 
error will be quadratic in the length of the price interval of each period. Summing up those 
errors will lead to this result. The detail of the proof is given in Appendix |10.5[ □ 



Now we look at the other term in ( 36 1 , we have 



E[pmmiY-, (nx - ^ r,p+)/(A«)/(A«)] 
= E[p{Y^ - niax(f" - (nx ~J2y,'^)+,0))I{A^)I{A^) 
> E[pY-I{A^)I{Ali)] - E[p{Y- + E - "^)^] 



(38) 



For the first term, we have 



E[pY-I{A^I)IiA^)] 
^ E[n{l-tl^)p\{p)I{A-)I{Al)] 

> (1-0 (^) )n{l tl^)E[pX{p)\I{A-)I{A^,)] 



(39) 



However, by our assumption on the bound of the second derivative of r(A), and (23) and 
( 24 1 we know that 

E[pX{p)\I{A-)I{Al^)] > p^X{p^) - C{pl.+, > P^'Hp'') - Ct-. (40) 

Therefore, 

E[pY^^I{A-,)I{Al^)] > p^'Xip^) ■ n{l - t^J - Cnr^. (41) 

Now we consider 

E[p{r^ + J2y^'; - nx)+]. (42) 



15 



First we relax this to 

+ (43) 

We have the following lemmas: 
Lemma 5. 

+ ^ Yi"^ - £;y" - ^ < Cnri", (44) 

where C is a properly chosen constant. 
Lemma 6. //p" > p"^ , then 

^ 2 + EY"" ~nx< Cnr'i , (45) 
where C is a properly chosen constant. 

Proof of Lemma [5] and [6} The proof of Lemma[5]repeatedly uses Lemma[T4|in Appendix |10.2| 
which bounds the tail values of Poisson random variables. The proof of Lemma [6] bounds the 
inventory consumed in the learning period. We show that by the way we define each learning 



interval in (171 and (18), with high probability, p'^ is always to the left of the center of the 
price interval. Therefore, the excess inventory we consumed in each iteration (compared to the 
consumption by p'^) is at most a second order quantity of the length of the price range (the first 
order error has been canceled out, or a negative value is remaining). This important fact will 
give us the desired bound for the consumption of the inventory. The detailed proofs of these 
two lemmas are give in Appendix [Toel □ 

Now we combine Lemma [4] [5] and [6] we have 

np^X{p°) - Cut- 
npD\{pD) 

. Therefore, Proposition [T] follows. Next we consider the case when > p". We claim 
Proposition 2. When p'' > p"^ , sup;,gr Ki^, T; A) < '^^'°f/^'"° . 

When p"^ > p", we condition our analysis on the time the algorithm enters Step 3. We define 
Y^ and as before, and Yfj to be Poisson random variables with parameters A(j5^j)nAf (Yfj = 
N {X{p^ j)nA'^)) . Also define Yf to be a Poisson random variable with parameter A(g)n(l — — 

i«) (f- = 7V(A(g)n(l-t^.-<r))). 
We define the following events: 

Bi = {io = 1} 
B2 = {io = 2} 

: (47) 

Bn^ = {io = N^} 
Bn^+1 = {The algorithm doesn't enter Step 3}. 

Then we have the following bound on JjJ' in this case: 

Jl{x, T, A) > Ep".^^"^(uS+'SO] + £;[pmin(y", {nx - ^ K;)+)/(i?^.+i)] 

1=1 j = l i,j 
2—1 j — 1 l=X i—1 j — 1 i—1 J — 1 
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We will get a bound on each term. We prove the following lemmas: 
Lemma 7. 

N" k" AT" 

Ep".^«"^(U^"+ii?0] > E nnp''X{p^)P{uf:+'B,) Cnrr. (49) 

z— 1 j — 1 i—1 

Lemma 8. 

^[pmin(f", {nx - ^ > p^A(p^) • n{l - tl^)P{BN^.+i) - Cnr^. (50) 

Lemma 9. 

J2 E E[pIjY:;,H^I\Bi)] > nT^p^X{p^)P{ul\Bi) - Cnrf. (51) 

i=l j = l i=l 

Lemma 10. For eac/i I = 1, A''", 

^;[gmin(f,^ {nx -J2Y.^^'^ ^^TWi)] > np^X{p^)il - ^ t^^.)P{B,) ~ Cnr^. 

i—1 j—1 i—1 j — 1 

(52) 

Proof. The proofs of the above lemmas resemble the proofs for the case when p" > p'^ ■ They 
are given in Appendix |10.7[ 



We then combine Lemma [7j[8j|9] and 10 adding the right hand side together, and Proposition 
|2] follows. Theorem [l] thus follows from Proposition [T] and [2j 



6 Lower Bound Example 

In this section, we prove Theorem |2] We show that there exists a class of demand functions 
satisfying our assumptions, however no pricing policy can achieve an asymptotic regret less than 

The proof involves statistical bounds on hypothesis testing, and it resembles the example 
discussed in [23] and [9]. However, since our model is different from that in [23] and [9j, the 
proof is different in many ways. We will discuss this in the end of this section. 

Proposition 3. Define a set of demand functions as follows. Let X{p; z) — 1/2 + z ~ zp where 
z is a parameter taking value in Z = [1/3,2/3] (we denote this demand function set by A). 
Assume that p — 1/2 and p — 3/2. Also assume that x = 2 and T — I. Then we have 

• This class of demand function satisfies Assumption A. Furthermore, for any z G [1/3, 2/3], 
the optimal price p^ always equals top" and p^ £ [7/8,5/4]. Therefore, it also satisfies 
Assumption B with e = 1/4. 

• For any admissible pricing policy tt, 

supK(x,r;z)> } Vn (53) 

z&z 3(48)^V" 

First, let me explain some intuition behind this example. Note that all the demand functions 
in A cross at one common point, that is, when p = 1, X{p; z) = 1/2. Such a price is called an 
"uninformative" price in j23j . When there exists an "uninformative" price, experimenting at 
that price will not gain information about the demand function. Therefore, in order to "learn" 
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the demand function (i.e., the parameter z) and determine the optimal price, one must at least 
perform some price experiments at prices away from the uninformative price; on the other hand, 
when the optimal price is indeed the uninformative price, doing price experimentations at a price 
away from the optimal price will incur some revenue losses. This tension between the loss during 
exploration and exploitation is the key reason for such a lower bound for the loss. Before we 
proceed, we list some general properties of the demand function set we defined in Proposition 

HI 

Lemma 11. For the demand function defined in Proposition^ denote the optimal price p^ 
under parameter z to bep^{z). We have: 

1. p^{z) ^ (l + 2z)/(4z) 

2. p^'izo) = 1 forzn = 1/2 

3. X{p^{zq); z) = 1/2 for all z 

4. -4/3 < r"{p: z) < -2/3 for all p, z 

5. |p^(z)-p^(zo)|>^ 

6. p^{z) = p^{z) for all z. 

Now in order to quantify the tension mentioned above, we need a notion of "uncertainty" 
about the unknown demand parameter z. For this, we use the K-L divergence over two proba- 
bility measures for a stochastic process. 

For any policy tt, and parameter z, let denote the probability measure associated with 
the observations (the process observed when using policy tt) when the true demand function is 
X{p; z). We also denote the corresponding expectation operator by E'^ . 

Given z and zq, the KuUback-Leibler (K-L) divergence between the two measures and 

over time to T is given by the following (we refer to [TT] for this definition): 



El 



n\{p{s);z) 



X{p{s);z) 
zop(s))log^ 



X{p{s);z 
zo - zopi-s 



Kpis)iZo) 
X{p{s);z) 



ds 



z — zp(s) I 
5 + z — zp{s) 



zp{s)) 



"(2+^0 



ZoPis)) } ds 



Zq 



z p{s) ^^1 ^"''2 ^ 



(54) 



Note that the K-L divergence is a measure of distinguishability between probability measures: 
if two probability measures are close, then they have a small K-L divergence and vice versa. In 
terms of pricing policies, a pricing policy tt is more likely to distinguish between the case when 
the parameter is z and the case when the parameter is zq if the quantity /C(P^,PJ) is large. 

Now we show the following lemma, which gives a lower bound of the regret induced by 
any policy in terms of the K-L divergence; this means a pricing policy that is better able to 
distinguish different parameters will also be more costly. 

Lemma 12. For any z Cz Z , and any policy tt setting price in V , 



IC{Vl,r:) < 24n(zo - z)^Rl{x,T;zo), 
where zq — 1/2 and R'^{x,T; zq) is the regret function defined in with X being X{p; zq) 



(55) 



Proof. The proof attempts to bound the final term in (54 1 and is given in Appendix 10.8 □ 



Now we have shown that in order to have a policy that is able to distinguish between two 
different parameters, one has to give up some portion of the revenue. In the following lemma, we 
show that on the other hand, if a policy is not able to distinguish between two close parameters, 
then it will also incur a loss: 
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Lemma 13. Let n be any pricing policy that sets prices in [p,p] andpao- Define zo = 1/2 and 



^0 + 4„i/4 (note z" G [1/3,2/3] for all n>2). We have for any n>2 

1 ^Kiv:^..v:^) 



Rl{x,T-zo) + Rl{x,T-z'l)> 



3(48)2 



'n 



(56) 



Proof. The proof uses similar ideas as discussed in [9] and [23]. Here we give some sketches of 
the proof. We define two non- intersecting intervals around p^{zq) and p-°(z"). We show that 
when the true parameter is zq, pricing using p in the second interval will incur a certain loss and 
the same order of loss will be incurred if we use p in the first interval when the true parameter 
is Zi- At each time, we treat our policy tt as a hypothesis test engine, that maps the historic 
data into two actions: 

• Choose a price in the first interval 

• Choose a price outside the first interval 

Then we can represent the revenue loss during the selling season by the "accumulated probabil- 
ity" of committing errors in those hypothesis tests. However, by the theory of the hypothesis 
test, one can lower bound the probability of the errors for any decision rule. Thus we can obtain 
a lower bound of revenue loss for any pricing policy. The complete proof is referred to Appendix 
MM □ 



Now we combine Lemma 12 and 13 By picking z in Lemma 12 to be z" and add (55) and 



(56 1 together, we have: 

2{Rl[x,T;zo) + Rl{x,T-z'l)) > 



> 



> 



32^^ 
1 



3(48)2, 
1 

3(48p; 



3(48)2 vn 

(/c(n„K.) + e"''^''""''''r^) 



The last inequality is because for any number w > Q, w -\- e ^ 
that for any n, no matter what policy is used, we always have 

1 



> 1. Therefore, we have shown 



supi?;^(a;,r;A) > 



3(48)2 



and Proposition [3] is proved. 



Remark. Our proof is similar to the proof of the corresponding worst case examples in [8] 
and [23], but different in several ways. First, in 8], they considered only finite possible prices 
(though their proof is for a high-dimensional case, for the sake of comparison, here we compare 
our proposition with theirs in the one dimensional case). In our case, a continuous interval of 
prices is allowed. Therefore, the admissible policy in our case is much larger. And the K-L 
divergence function is thus slightly more sophisticated than the one used in their proof. In fact, 
the structure of our proof more closely resembles the one in [23^ where they consider a worst-case 
example for a general parametric choice model. However, in their model, the time is discrete. 
Therefore, a discrete version of the K-L divergence is used and the analysis is based on the sum 
of the errors of different steps. In some sense, our analysis can be viewed as a continuous-case 
extension of the proof in [23] . 



7 Numerical Results 

In this section, we perform numerical tests of the dynamic pricing algorithm discussed in previous 
sections. Specifically, we compare the results by using our dynamic price learning algorithm to 
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the algorithm proposed in [S] 

In our numerical tests, we consider two underlying demand functions. One is linear with 
Xi{p) = 30 — 3p and the other is exponential with X2{p) ~ SOe^'^'^P. These two demand functions 
are in accordance with the demand function chosen in the numerical tests in |8j where they 
considered Ai = 30 — 3p and A2 = 10e^~°'^P. The reason that we change the constant in X2{p) is 
that we want to examine two different cases for our algorithm, one with p'^ > p" and the other 
with > p'^. Note that with underlying demand function Ai, we have p^ = p" = 5 > p"^ = 3, 
and with underlying demand function A2, we have p^ = p'^ = 2 ln4 > = 2. In both cases, 
we assume the initial inventory level is x = 20, the selling horizon T = 1, and the initial price 
interval is [p,p] — [0.1, 10]. For each case, we run lO'^ independent simulations, comparing the 
average of them to the deterministic optimal solution (the standard error for the cases of n = 10 
and n = 100 is less than 0.4% of its mean and the standard errors for the remaining cases are 
less than 0.01% of its mean). Note that the above settings are exactly the same as those in [Hj. 
We also make the following modifications to our algorithm in implementations: 

• We remove the logn factor in r" and in our numerical study. Otherwise the factors 
(logn)^ and (logn)'^ in (30) and (33) are too big for the cases we study. Since the logn 



factors are mainly used for analysis purposes, this modification is quite reasonable. In 
fact, this modification leads to better performance in revenue in the cases we study. 

• Whenever our algorithm enters Step 3, instead of using [p^p] as the initial interval as we 
stated in our algorithm, we use [p" iP";,]: which is the last computed price interval. As 
we showed in Lemma [2j with high probability, this interval contains the deterministic 
optimal price, therefore intuitively this will also guarantee the asymptotic behavior we 
have (although we restart the process in our stated algorithm for ease of analysis). This 
would also make some improvement to the performance of our algorithm. 

Before we show our comparison results, we first use an example to show how our algorithm 
actual works, that is, what is the time length for each step and how the price interval evolves. 
We take the linear demand function case with Xi{p) = 30 — 3p and n — 10^ as an example. A 
sketch of a single run of our algorithm in this case is shown in Figure [T] 

In Figure [l] we can see that our algorithm runs 4 iterations of price learning before entering 
the last step. The time spent in each iteration is increasing, which is in accordance with 
our definition of r" and is also intuitively true: since we are using a more accurate price for 
experimentation in each iteration, we can afford to spend more time without incurring extra 
losses. Besides, the number of prices tested in each interval is decreasing, along with the length 
of the price interval and the price granularity. Therefore, with time evolving, we are testing 
fewer prices on each interval, on finer grids, and test longer for each price. And finally, when 
r" > 1, we apply the last learned price in the rest of the time horizon. 

Remember that we evaluate our algorithm by the regret function -R^(x, T; A) defined in 
In Theorem [ij we showed that asymptotically, the regret is of order (with a logarithmic 
factor), when we choose any fixed S < 1/2. In other words, log(i?^(x, T; A)) should be approx- 
imately a linear function of logn with slope —S. In the following, we choose d — 0.49, and we 
conduct numerical experiments for problems with different sizes of n and study how changes 
with n. Specifically, we use a linear regression to fit the relationship between log -R^(x, T; A) 
and logn. The results are shown in Figure [2(a)] and [2(b)| 



In Figure 2(a) and |2(b)[ the slopes of the best linear fit of the log-regret and log-n are 



approximately 0.444 and 0.465, respectively. Although it is somewhat less than the stated 
S — 0.49, it is significantly larger than the slope of 0.25 obtained in [8 for the non-parametric 
policy, and even the slope of 0.33 obtained for the parametric policy in [8J. The deviation from 
S however, may be due to the ignorance of the log-factor which is not insignificant in the cases 
we study. Besides the asymptotic behavior, we also compare our regrets to the one obtained in 
[8], which is shown in Figure [s] 
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Iterl Iters 

1^=0.0035; T3=0.15 

p range: 0.1-10 p range .'4. 90-5. 35 

p''=5.05 p"=5.07 

p^ = 3.4 p^ = 4.90 Applyprice5.01 until the end 



\fl 1 ^ 


^ 1 \ \ 


f 


1 > 


i I 1 i 


k 1 ^ 



Iter2 lter4 



T 2=0.037; T^=0..3.5; 

p rsr\ge: 3.99-5.69 p range: 4.96-5. 18 

p"=5.21 p"=5.01 

p' = 3.99 p' = 4.96 



Figure 1: Time and price evolvement of our algorithm with X(p) = 30 — 3p. and 
deterministic optimal price = p^ = 5 
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l.E+OZ 



l.E+03 



l.E+04 



l.E+05 



l.E+06 



(a) Regret vs n with Ai = 30 — 3p 



l.E+00 



l.E-01 



l.E-02 



l.E-03 




(b) Regret vs n with A2 — 80e 



-0.5p 



Figure 2: Numerical results for the dynamic pricing algorithm. Diamonds show the 
performance of our algorithm and the solid line passing through the points is the best 
linear fit to those points 
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l.E+00 




1.E-Q3 ' ' ' ' 

l.E+02 l.E+03 l.E+04 l.E+05 l.E+06 

n 



Figure 3: Comparison between our dynamic algorithm and the non- 
parametric/parametric pohcy in [8]. The sohd hne is the performance of our 
dynamic pricing pohcy, the dashed hne is the non-parametric pohcy in [8] and the dot 
hne is the parametric pohcy in [8] 



As we can see in Figure |3j the regret obtained by our algorithm is well below the one 
obtained by the non-parametric policy. This means that by using a dynamic learning strategy, 
we can indeed improve the performance quite significantly compared to the policy when we only 
learn the price in one period. It can be seen that when n achieves 10^, the performance of our 
algorithm also surpasses the parametric policy (with one learning period). Also note that the 
regrets obtained by our algorithm have larger deviation from the linear regression model than 
the one shown in [5]. This may be because when we use multiple period learning, the different 
particularities of each individual problem (e.g., where the grid points are positioned) have a 
larger impact on the results, since it might affect the positions of the later price intervals. 



8 Extension 

8.1 Other Applications 

As we mentioned in the beginning, our work can be applied to a general class of single-product 
revenue management problems. In this section, we shed some light on the potential applications. 
We start with a case where the advertisement intensity is the decision variable. 

Consider a company selling a single product over a finite time horizon. Due to the market 
competition, the price of the product is fixed. However, the company can choose its advertise- 
ment strategy to affect the demand rate. Assume the firm can choose an advertisement intensity 
parameter a; for example, in the online selling case, a may be the pay-per-click price the com- 
pany paid to search engines. The demand rate under advertisement intensity a is denoted by 
A(a). In this setting, the company controls at and the revenue collected is: 

/ {p-at)dN\ (57) 

Js=0 
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where iV* = N{£^^ A(at)di) is a Poisson random variable. Consider the following transforma- 
tion: 

Wt = p - at 
X{wt) = A(at). 

Then this problem will have the same form of the optimal pricing problem we discussed through- 
out the paper. Therefore, when the demand function satisfies the same set of conditions, our 
theorem will apply. 

Besides advertisement intensity, the control variable can be viewed as the sales person com- 
pensation or other incentives of selling a product, as long as a similar formulation can be 
established. We believe there are more examples in practice that fit into this model. 



8.2 When the second derivative assumption is not satisfied 

In Assumption A, we assumed that r"(A) exists and is bounded away from zero. This assumption 
is necessary in our analysis (at least locally at p") since we utilize the local quadratic behavior 
of the revenue functions. However, there are a few cases in practice when the demand function 
does not satisfy this assumption, e.g., when the demand function is piecewise linear and the 
revenue maximizing price p" is exactly at one of the "kink" points of the piecewise function. In 
that case, at A(p"), r(A) behaves more like a linear function. A natural question is: can we still 
achieve the same asymptotic behavior for those cases? The following theorem gives an assertive 
answer to this question, although it requires us to use another algorithm (Algorithm DPA2, see 



Appendix 10.10 1 to achieve this 



Theorem 3. Let Assumption A hold except for the third requirement. Let Assumption B hold 
for a fixed e > 0. Also, assume that for any p e [p,p], L\p ~ p"\ < |r(A(p)) — r(A(p"))|. Then 
for any S < 1/2, there exists a policy G V generated by Algorithm DPA2, such that for all 
n>l, 

sup Rl' {x, T; A) < (53) 

for some constant C . 

Theorem |3] complements our main theorem in some cases when the demand function is not 
differentiable. In fact, in this case, a simpler learning algorithm (see Algorithm DPA2) involving 
only one learning step would work. However, to apply this theorem, one requires the advance 
knowledge that the demand function has a "kink" at optimal. How to combine this case and 
the case in our main theorem is one of our future work. 



9 Conclusion and Future Work 

In this paper, we present a dynamic pricing algorithm for a set of single-product revenue man- 
agement problems. Our algorithm achieves an asymptotic regret arbitrarily close to 0(rt^^/^) 
even if we have no prior knowledge on the demand function except some regularity conditions. 
By complementing with a worst-case bound, we show that our algorithm is almost the best pos- 
sible in this setting, and it closes the performance gaps between parametric and non-parametric 
learning and between a post-price mechanism and a customer-bidding mechanism. 

In terms of the algorithm itself, the dynamic learning algorithm integrates learning and 
doing in a concurrent procedure and may be of independent interest to the revenue management 
practitioners. 

There are several open questions to explore including how to extend this result to high- 
dimensional problems (with high-dimensional control or/and inventory products). In high- 
dimensional problems, the structure of demand functions may be even more complicated and 
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the extension is not straightforward. Other directions may include models with competition 
among retailers and/or strategic behaviors of the customers. 
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10 Appendix 

10.1 Examples of demand functions 

1. For linear demand functions \{p) = a — hp with Q < a< a <a and < 5 < 6 < &, it is easy 
to see that all our assumptions hold with M = a — bp, K = max{&, 6^^, a + 2bp}, — b 
and mjj = b. 

2. For exponential demand functions A(p) = ae^''^ with < a < a < a and < b < b <b, 
we have 

• \X{p)\ <ae-^-P 

• X{p) is Lipschitz continuous with coefficient a ■ b, r{p) is Lipschitz continuous with 
coefficient a + abp, and 7(A) is Lipschitz continuous with coefficient 

• r(A) — log A + ^ logo is second-order diffcrentiable and < r-"(A) < 

3. For logit demand functions X{p) = j^^p^-rs^, with < a < a < a and < 5 < 6 < 6, we 
have 

• \Hp)\ < 1 

• X{p) is Lipschitz continuous with coefficient b, r{p) is Lipschitz continuous with co- 
efficient \ -\-b -p, and 7(A) = 5 (log — a) is Lipschitz continuous with coefficient 
4 

b 

• = t(log ^ - a) is second-order diffcrentiable and -7 - e~°~'' P < r"(A) < 

10.2 A Lemma on the Deviation of Poisson Random Variables 

In our proof, we will frequently use the following lemma on the tail behavior of Poisson random 
variables: 

Lemma 14. Suppose that p, e [0, M] andrn > n'^ with /3 > 0. //e„ = 277^/2M^/^(logn)^/^r,^^^^, 
then for all n > 1, 

C 

PiN{fir„) - firn > r„e„) < — (59) 

and 

P{N{firn) - /ir„ < -r„e„) < — (60) 
for some suitably chosen constant C > 0. 

We refer Lemma 2 in the online companion of fS^ for the proof of this lemma. 
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10.3 Proof of Lemma [T] 



Proof: By plugging (29 1 and ( |31[ ) into (24 1, we have 

iV" = nim{/ : „25-i+(i-5).(f )' ^ (logn)^}. (61) 

Therefore, for any fixed S < ^, we can take 

TVl = Llogf + 1, (62) 



where [xj is the largest integer less than x. Similarly, by plugging ( p32| ) and (34) into (28 1, we 
have 

iV^ = min{/ : „2*--i+(i-«)-(f )' < (logn)^}. 
Therefore, for any fixed S < |, we can take 

n! = [log. i^j + 1. 

Thus Ns — max{Ng, Ng} will be an upper bound for the number of iterations of our algorithm. 
Furthermore, Ng doesn't depend on n. □ 

10.4 Proof of Lemma [2] 

Part 1: We first prove the first part, that is, when the algorithm runs within Step 2 until i 
reaches N^. 

We prove by induction on i. By the induction assumption, we know that at iteration i, with 
probability 1 — O (^), £ If. Now consider the next iteration. Define 



vL = log n ■ max ■ 



riT" 



First, we establish a bound on the difference in revenue r{X{pf)) — r(A(p")), where is the 
revenue maximizing price on If and is the empirical revenue maximizing price defined in our 
algorithm. We assume pf is the nearest grid point to pf in this iteration. We consider three 
cases: 

• < p": This is impossible since we know that > > and by the induction 
assumption p^ G If. Therefore we must have G If, and by definition, achieves a 
larger revenue rate than p", which is contradictory to the definition of 

• pf — P^. In this case, by the granularity of the grid at iteration i, we have Ipfj' ^ p"| < 

and thus by our assumption that r"(A) is bounded, we know that |?'(A(p")) — 
r{X{Pi j*))\ < ttilK^ ■ C\^^ )^, therefore we have: 

r{X{p^)) - r{X{pf)) 

- r(A(p-)) - r(A(p^^.)) +P",A(p^^.) " ) " (pf Hpf) ^ P"^ m)) + Plr HPlr) ~ Pf Hp") 

< ruLK^'^f + 2maxi<,<,y \pl^X{pl^) - pi;X{pl^)\. 

„ ' (63) 
In (63), A is the observed demand rate and the last inequality is due to the definition of 
and that is among one of the p^^ . 



27 



By Lemma [M] in Appendix |10.2[ we have 



P{\KpI,) - \{pl,)\ > CV^-J^) < ^, (64) 



with some suitable constant C. Therefore, with probabiUty 1 — O(^), r(A(p")) — r(A(p")) < 
Cu^. However, by our assumption that r"(A) < mj/ and that 7(A) is Lipschitz continuous, 
with probabihty 1 — O (^), p"| < C^/v^ (here "C" represents some generic constant, 
and the relations are not always specified). 

Now we consider the distance between and (this part of result can also be found in 
Lemma 4 in the online companion of |8]). Assume Pij* is the nearest grid point to pf. 
Then, using that we assumed T = 1, we have: 

\xm~^\ < \m)-x\ + \xm-m)\ 

< \X{pl^,)~x\ + \Xip'^)-X{p-r)\ 

< \X{pl^,)-x\ + \X{pl^,)-X{pl^,)\ + \X{p^,)-Xm ^ ^ 

< |A(p?) ~x\ + IXipl) - X{pl^,)\ + 2maxi<,<.e \X{pl^) - A(p«^.)l- 

And by the definition of p^, A(pJ^) — x and A(p^) — x must have the same sign, otherwise 
there exists a point in between that achieves a smaller value of \X{p) — a;|. Therefore we 
have 

|A(pD - Xipdl < |A(K) - Xipl^,)\ + 2 max^ \X{p^^^) - A(p^^.)l- (66) 
By the Lipshitz continuity of A, we have 

\X{p'r)-X{pl^.)\<K^^^. (67) 



Also by Lemma 14 in Appendix 10.2 we have with probability 1 — O (^), 



maxJA(K,)-A(pr,)l <Cyi^.^^. (68) 

Therefore, with probability 1 — O (;^), we have 

\Xip'i)~Xm<cVK (69) 
and by the Lipschitz continuity of i^(A), this implies that with probability 1 — O (^), 

m-p^\<cVK. (70) 

Therefore, we have 

P{\p.,-p^\>Cy/K} 

< pm-p^\>c^j+p{\p:-p'^\>c^,}<o 



u 



(71) 



Here we used the fact that: 

I max{a, c} — max{&, d\\ > u |a — 6| > m or |c — c?| > 
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Note that ( 71 1 is equivalent to saying that 



Pip"" e[p,-C^,,P, + C^])>l-o(^^y (72) 

Now also note that the interval li^i in our algorithm is chosen to be 

lognR" -g^ „ 21ognP"-g; ^ 



which is of order yTogn greater than -y/ ujj (and according to the way we defined and 
r", the two terms in are of the same order). Therefore we know that with probability 

i-0(i),p^e/r+i. 

• < p": In this case, — p'^ . With the same argument, but only the p'^ part, we know 
that with probabihty 1 - O (^), G 

Also, as claimed in the previous lemma, the number of steps A^" doesn't depend on n when 
(5 < i is fixed. Therefore, we can take a union bound over steps, and claim that with 
probability 1 - O (i), G I^, for aU i = 1, A^". 

Part 2: Using the same argument as in Part 1, we know that with probability 1 — O (-), 



pD g ju j _ l,...,io. Now at io, condition (16 1 is triggered. We first claim that whenever 



(161 is satisfied, with probability 1 — O (^), p^ = p'^ 

By the argument in Part 1, we know that with probability 1 — O (^), 



\pI-pU<V^-^^^ 



\pl-pl\<./Iogn- 

Therefore, if 



p — p 

>pro+2yi^-^^hr^ (73) 

No 



holds, then with probability 1 — {^^')^ 

Pl>Pl 



And when (73) holds, pi is not the left end-point of I^^^ and p"^ is not the right end-point of 7^" , 
which means 

Now we consider the procedure in Step 3 of our algorithm and show that with probability 
1 - O (i), p^ = p= e If for all i = 1, 2, A^ 

We again prove by induction. By the induction assumption, we can assume that with 
probability 1 — O(^), p^ = p'^ G If. Now we consider (which is the optimal empirical 
solution in Step 3 in our algorithm). Define in this case 



ig n ■ max 



P^-P^ 




Using the same discussion as in Part 1, we have with probability 1 — O (^) that 

\q^-p^\^m-p'\<cvl. 
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However, remember that in our algorithm, the next interval is defined to be 
jc logn p1 - ^ . logn pI-P':, 



which is of order -y/Iogn larger than w^. Therefore with probability 1 — 0(^),p^ — p'^ e ^i+i- 
Again, taking a union bound over these Ns steps results in this lemma. □ 

10.5 Proof of Lemma |4] 

Proof. Define Af^ = {uj : Y^j - EY^ < EY^}. First we show that 

f]A-cAl (74) 

hi 

To show this we only need to show that "^J^ij-^^ij — R-ecall that |A(p)| < M, for all 
p e [p, p] . We have: 

EY^ <mY^ nA, = Mn - MnNsr^^ . (75) 

ij i,j i=l 

By our definition of r", we know that for every fixed S, r^u is of order less than 1 (otherwise the 
algorithm is stopped at the previous iteration). Therefore, when n is large enough, 2 ^ EY^j < 
nx, i.e., (74) holds unifo rml y in n. 

However, by Lemma 14 we know that P{Afj) > 1 — O (^) and thus 

P{Ai) >l-0 



since each k" < n and N is independent on n. 



Now we can rewrite the left-hand side of (37) as: 

N" ft" 



^5^£;[p^^.y,^j(^5')/(A^)] 

i=i j=i 

^ ^ £;[p^, j(A5')/(^2)^^K"^(^r^2)]] 
i=i j=i 

5]X^ii;b^/(^5')/(A^)(ii;K^] - E[Y-i{{Arru{A-r)])]. 
i=l J=l 



(76) 



However, by Cauchy-Schwartz inequality and the property of Poisson distribution {EN(X) — 
X,E[NiX)^] = X^+X) 

E\Y-I{{Air U (A«)^)] < ^[E{Yi;)Y ■ U {Am < O (^-)=) E[Y-]. (77) 
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Now we plug this back into ( 76 1 and obtain 



N" ft" 
i=l 3 = 1 



\ v""/ i=l j=l 

= (l-O (^) ) f] £;[pr,,A(pr,,)^Ar/(A«)/(A^)] 
^"^ i=l j=l 

/ 1 \ ^" 

= (l-O ^ )^^i?[pr.,A(pr.,)nAri/(A^)/(^^)]. 
^ i=i 1=1 



Now we consider 

^ ii;[pr,,A(pr,,)^Ar|/(v45')/(A^)]. 

By the bound on the second derivative of the function r(A) and the assumption that 
PhMplj) = r{X{pl^)) > r{X{p^)) - mUXiplj) - Kp"")? > p'^Kp'') - m^K^p^ ~ fj' 
Therefore, we have 

(1 - O ( ^ )) • EE^K.^bL)"A^|/(A5')/(A^)] 



TV" 



> (1-0 j ) • (5]p^A(p^)nrr - m^i^^ j](pr " tfnr^) 



1=1 

TV 



> (1-0 ( -i. )) • (^p^A(p^)nrf - miif2iV"nrr) 



JV" 



> 5]p^A(p^)nrr-OnTr, 



where the second to last step is due to (22) and (23), and the last step is because ^ 
^i/2-25-(i-5)(3/5)'-i(-iQg^)5 foj. j ^ i^...^iV". Therefore, the lemma holds. □ 

10.6 Proof of Lemma [5] and [6] 

Proof of Lemma [sj First we show that with probability 1 — 0(1;), 
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Then by Cauchy- Schwartz inequality, 



< O ( {EY- + E.., EY,]) < Cnrr (83) 



where the second to last inequality is because that for a Poisson random variable N{^), Var{N{}i)) = 
fi^ and the last inequality is because the demand rate is bounded and r" > n~^^^. Therefore, 
we have 



EiY"" + Yir - EY"" - EY^'^)+ < Cnr^ 
which implies our lemma. 



(84) 



To show (82), we apply Lemma 14 For each given i,j, by Lemma 14 we have 
P(y,^ - EY;^ > 2M^nAf\ogn) < ^. 
By taking a union bound over all i, j, we get 

PiE - E EY,] > 2M ^nAflogn) 
< E^i E;Ii PiY^j - EYll > 2M^nAf logn) < O(-), 



where the last step is because < n and iV^ > A'"" is a constant with respect to n. 
On the other hand, by the definition of r" and nf, we have 



(85) 



(86) 



where the last inequality follows from the definition of and r" in ( 29 1 and ( 30 1 . We then 
consider — again use inequality (59 1. We have 

P(r" - ElY"] > 2My/n\ogn) < 

since \/ n log n < nr" when S < ^, the lemma holds. □ 

Proof of Lemma [6[ By definition, we have 

i?r,^=nA(pr,,)Ar, 



where 



pr + (j-i) 



logn Pt^i T\P^ ~Pz 



And as we showed in 



And by our above discussion, with probability 1 — condition (161 doesn't hold, i.e., 

Pi-i > P^-i - 2Vlogn ^ 



70),pU>pU~VW^ 



Pi-i-Pi 



Also since > P^, we must have pl^i > ]f ■ Therefore wenave 



> 



^ + (J-1) 
> P^ + (,7-l) 



logn 



logn Pi-i 



- Sv^logn • 



Pi^i -P. 



(87) 
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when yTogn > 6. Using the Taylor expansion for X{p), we have that 
X{pl,) < A(p=) + ((j-l)^ 



Pi — P" logn Pt-i ^, ,,„, „ ,,,2 



—u \ 2 



(88) 



The last inequality uses the fact that X"{p) is bounded by a constant which is not hard to derive 
from Assumption A. Therefore 



AT" 



(89) 



where the last equation follows from (22 1 and (23 1. 
Also we have 

EY-^XiPHl-tl^), 

and with probability 1 — O (i), 

P = PN^ + 2\/log n ■ ^"]^jv^"" 

> max(p]^„ , ) + 2 Vlog n ■ 

> max(pXr„,p^„) 

> p" > P", 



(90) 



where the first equation is due to the definition of p, the second one is due to ( 71 1 , and the last 
one is by Lemma [2] Therefore, 

EY" < X{p')n{l - tl^) 



and thus 



□ 



E%j + £;y" <nx + Cm^. 



10.7 Proof of Lemma [§1 |9l and [TO 

Proof of Lemma [7} Define Ai = {cj : p^ 
P{A{) > 1 - O (^), and we have 



G /" for all z}. By Lemma [2] we know that 



TV" tj' AT" k" 

i— 1 J — 1 i—\ j — 1 



However, note that U^^-^^B/ — (U^^^Bj)^ only depends on the realization up to period i — 1. 



ri-l 
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Therefore, we know that Y^" given p"^ is independent of U^j^^B;. Therefore, we have 



AT" 



i=l j=l 



i=l J=l 



j=l 

AT" 

1=1 



where the second to last inequahty is because of the bounded second derivative of the revenue 



function and that A holds; and the last inequality is because of the relation of (23) and that 
iV" is bounded by a constant which is independent with n. □ 

Proof of Lemma [Sl Define 

A2 = {lo: For each i, j, Y,"^ < EY^"^ + 2M^nA^ logn and f" < EY"" + 2My/n\ogn}. (93) 

By Lemma 14 P{A2) = l-0{}^). Similar to (38), we have the following relation: 

i?[pmin(f«,(nx-^r«)+)/(^2)/(S^.+i)] 

> E[pY^I(A2)I{Bm^+i)] - E[p{Y'^ + Y.^i; ~ nx)+/(A2)/(B^.+i)]. 



(94) 



And by the same argument as (41 1 we have 



E[pY^I{A2)I{Bn.+,)] > (1-0 (^^^ )(p^A(p^) • 71(1 - iV)^(5iV"+i)) - Cnr^) 

> p^A(p^) • n(l - iV)^(SjV"+i)) - Cnri". 



Now we consider 



We first relax it to 



E[piY^ +^Y,^ - nx)+I{A2)I{Br,.+^)]. 
pE[{Y^+Y.^i; - nx)+I{A2)I{B^.+^)]. 



(95) 
(96) 

(97) 



Conditional on A, Y^^ < EY,^+ 2My^nAf logn for all i,j and F" < EY,"^ + 2M^n\ogn, 
and by the argument in Lemma 5 we have ■ 2M y/riNf\ogn + 2M ^/ nlogn < Cnr^. Also, 
by the same argument as in Lemma pi we have J2i j ^^ij + — nx < Cnr^. Therefore, the 



lemma holds. □ 



34 



Proof of Lemma joj Define — {uj : G I^,yi}. By Lemma |2| we know that ^(^3) = 
1 — O (^). Also note that each Y^'^j given is independent with UjCiBi- Therefore we have, 



1=1 j=l 

= 1111 E[nA'^pl^X{pl^)I{ul\Bi)I{A,)] 



i=i j=i 



^ ^ i=l i=l 

> nT^p^X{p^)P{ul\Bi) - Cnrl, 



(98) 



where the equahty is because of the independence of Y^^ and Bi as we argued above, the second 
inequahty is because p^ € If for all i and the Lips chitz continuity of the revenue rate function. 
The last inequality is because of the relation ( 27 1 and that is bounded by a constant 
that is not dependent on n. □ 

Proof of Lemma llOi Define 

Ai = {u: < EY^ + 2M^nAf logn and F.^ < EY.^^^^ + 2M^nA^logn, Vi, j; 

< + 2MVnlogn and p° e /f , Vi}- 



By Lemma 14 we have that P{Ai) = 1-0 (^). 
By the same transformation as in (94), we have 

min(f,^ {nx-Y^Y.^i;-Y^Y. 

1=1 j=i 1=1 j=i 

> E[qYfI{Bi)I{A,)] - + E E + - ria:)+/(i30/(^4)]. 



For the first term, we have 
Em^IiBOHA^)] 

> {i-o(^^^)EmYiViBmA,)) 

> (1-0 (-) )E{n{l -tr- t%.mq)I{Bi)) 



(100) 



> (1-0 (^) )(r^(l - ir - t^Are)p^A(p^)P(BO - n(l -tf- t%.){p%.+, - f^^^^)) 

> n{\-t^~t%.)p°X{p°)P{Bi)-CnTi, 



(101) 
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where the third inequahty is because of the definition of q and the Lipschitz continuity of the 
revenue rate function. And the last inequahty is due to condition ( 28 1 . 
For the second term, we first relax it to 

pE[{Y^ Y^^ + E E + ^i' nx)+m)I{A,)], 

i,j 1=1 j = l 

And by the definition of A4, we have, 

^[(E + E E + - nx)+m)i{A,)] 

I 

< (E EY,'^, + E E ^^^^ + ^^i' - "^)^ + ^""^i ^ ^"^f- 

ij i=l j=l 

(102) 

The first inequality is because as we arg ued in Lemma|5| J2i,j 2M ^/nA^^og^+Y.^.J 2M ^nA^= log 
M-v^n log n < CnT", and the second inequality is because 

E^^^5+EE^^<" + ^^^ 

ij i=l j = l 



n+ 



1=1 
I 



N" 



i=l 



i=l 



< nx + Ctit", 



(103) 



where the first inequality is by definition, the first part of the second inequality is by Lemma [6j 
the second and third part of the second inequality is due to the continuity of the demand rate 
function and that G If, so that p'i ^ — p^ < M{p1 — p'r) and the relation in (27|. And the 
last inequality is because when p"^ > p", A(p^) is x (remember that we assumed T = 1). 
Combining ( 101 1 and ( 103 1 together proves that the lemma holds. □ 



10.8 Proof of Lemma 



Proof. Consider the final term in (54 1. Note that we have the following simple inequality: 

^2 



log (x + 1) > X - 



Therefore, we have 



2(1 -N)^ 

{X - 1)2 

2(2 -x) 



Vx < 1. 



(x - 1)2 

loKx-l<-a;+^— '-- V0<a;<2. 



Apply this relationship to the final term in (54 1 and note that for any z € [1/3,2/3] and 
p(s)e [1/2,3/2], 



2 . 5 + z - zp{s) 



< 



3 I + - zqp{s) 



< 2, 



(104) 



36 



we have 

" "Jo 2(2- i0±g^|Pi£)) (l/2 + zo-^op(s))2 - '"J„ il/2 + zo-zop{s)y 

Also, for z e [1/3,2/3] and p(s) £ [1/2,3/2], we have 

l + zo-zapis)>^. (106) 

Therefore, we have 

fCiVl,V:) < 16n(zo - zfEl [ {l~p{s)fds. 

Jo 

However, under the case when the parameter is zq, we have — 1 and 

Rl{x,T;zo) = i_ 



J^{x,T;zq) 
^ ElJ^{r{p^)~r{p{s)))ds 
ElJ^r{pD)ds 

> \Elj {l-p{s)fds, 



(107) 



where the first inequaUty follows from the definition of and that we relaxed the inventory 
constraint, and the second inequality is because of the 4th condition in Lemma [TT] Therefore, 

nVl.V:) < 24n(zo - zfRl{x,T-zo), (108) 

and Lemma [12] holds. □ 

10.9 Proof of Lemma [131 
Proof. 

We define two intervals C^q and Cz^ as follows: 

= - 48^'^"(^°) + 48^] = - 48^'^"^^") + 48^]- 



Note that by the 5th property in Lemma 11 we know that C^a and C^n are disjoint. 
By the 4th property in Lemma [TT] we have for any z, 



r{p^{z)- z) - r{p; z) > ^{p - p^{z))\ (109) 



Also by the definition of the regret function, we have 

El loMp'^izoy, zq) - r{p{sy,zo)}ds 



Rl{x,T-zo) > 



ElJor{p^{zo);zo)ds 



> 



4 



3(48)2 Vr^Vo 



(110) 
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where the first ineqiiahty is because we relaxed the inventory constraint when using tt, and the 
second inequality is because of (109), the definition of C^^ and that the denominator is 1/2. 

In the last equality, Vzq^^ is the probability measure under policy tt and up to time s (with 
underlying demand function has parameter zq). Similarly, we have 

E:jl{T{p^{z-,)-r4)~r{p{s)-z-,))ds 



K{x,T;z^) > 



E^.J^r{pD(z^^);z^)d.s 



> 



(111) 



where in the second inequality, we use the fact that the d enominator is less than 1, and in the 
last equality, Vz}'^^ is defined similarly as the one in (110). 

Now consider any decision rule that maps historical demand observations up to time s into 
one of the following two sets: 

Hi : p{s) e 

By Theorem 2.2 in [25] . we have the following bound on the probability error of any decision 
rule: 

v:yHpi^) e Czi^} + v:,{'\p{s) i C,.} > 1 (112) 



However, by the definition of the K-L divergence (54), we know that 

r-l 



^{n, , ) - /c(Kj^^ , vi^^ ) > Ei^ 



1 {z^-z)\\~p[s)f 
2(2 - ^^/^ + ^" " 



ds > 0, 



where the last inequality is because of (106). Therefore, we have 

1 -A 



(113) 
(114) 



Now we add (110) and (111) together. We have 
Rl{x,T-z^) + Rl{x,T-z^) > 

1 

- 3(48)V^^ 

Thus, Lemma [13] holds. □ 

10.10 Proof of Theorem M 

We first define Algorithm DPA2. 



'0' '1' 



Algorithm DPA2: 
Step 1. Initialization 
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(a) Consider a sequence of Ti,Ki, i = 1,2, ...,N {N will be defined later). Define = p and 
p^ = p. Define ti = J^j^i tj , for i = to N; 

Step 2. Dynamic Learning 

For i = 1 to do 

(a) Divide into Ki equally spaced intervals and let {Pi,j,j = 1,2, be the left 
endpoints of these intervals 

(b) Divide the time interval [ti--i,ti\ into Ki parts and define 

Aj = — , tij =ti-i+jAi j = 0, 

(c) Apply Pi J from time Uj-i to tij, as long as the inventory is still positive. If no more 
units are in stock, apply p^ until time T and STOP 

(d) Compute 

, total demand over [ij ,] 
d{pi,j) = 3 = 1, Ki 



(e) Compute 



and 



Pi = arg max {pijd{pij)} 

l<J<Ki 



Pi = arg min - x/T\ 

l<3<Ki 



(f) Set Pi = max{p?,p"}. Define 



and 



+i_ logn p'-f 



2 Ki 



And the price range for the next iteration 

I^+l = [f+\p'+'] 

Here we truncate the interval if it doesn't lie inside the feasible set of \p,p\; 

Step 3. Applying the optimal price 
(a) Use pn for the rest of time until the stock is run out. 



Similarly as we study algorithm DPA, we first list the set of equations we want the parameters 
to satisfy: 

PiZA^ fK^ Vi = l,...,Ar (115) 

Ki y TiTj 

Pi+i-2m~l°S"-^^' Vi = l,...,Ar-l (116) 

Pi-P^ 



Ti+i =^-v'b^~ri, ^i = l,...,N -I. (117) 
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Also we define 

\/logn • (pi - p, ) 

iV = min{?|— ^ < Ti}. (118) 

The meaning of eacli of these equations is similar to the one in DPA (but since we have 
a different local behavior of the revenue function under this alternative assumption, we have 
different relationships between these parameters). We solve Ki and from the above relations. 
Define ti — n^^ ■ (logn)'^. We get 

= ns(i-'5)-(§)'"'logn, Vz (119) 
r, =ni-25-(i-5)(fr-\iogn)3, \fi (120) 

And as a by-product, we have 

p,-p,^=n-(i-^)(i-(i)'"'), (121) 
Now wc prove some lemmas similar to those we used to prove Theorem [T] 



Lemma 15. Fix S < 1/2. defined in [118) exists. Moreover, N is independent of n. 



Proof. We plug (119) and (121) into (118). We get A^ = log? which does not depend on 

n. □ 

Now we prove a lemma similar to Lemma [2] showing that the price range for each learning 
period contains the actual optimal price, with high probability. 

Lemma 16. Denote the optimal deterministic price by p^ . And the assumption that L\p" —p\ < 
|r(A(p")) — r{X{p))\ hold. Then with probability 1 — O {ji) , G li for any i — 1, ...,N. 

Proof. Like the proof of Lemma [2] we prove by induction. Assume that with probability 
I — O p^ £ li. Now consider the {i + l)th interval. Define 

K^yiog^imaxl^^,./^!. (122) 

We consider three cases: 

* Pi < p^'. This is impossible since we know that p^ > P^ > P^ and by the induction 
assumption e h. Therefore, we must have S li and by definition, p" achieves larger 
revenue rate than p", which is contradictory to the definition of p" 

• pf — p". We assume pi,j* is the nearest grid point to p" in this iteration. In this case, by the 
granularity of the grid at iteration i, we have |pi,j» — p" | < ^'^ ~' and thus by our assumption 

that r(p) is Lipschitz continuous, we know that |r(A(p")) — r{X{pij»))\ < C ■ ( ^\ ~' ) , 
therefore we have: 

r(A(p")) - r(A(p-)) 

= r(A_(p")) - r(A(p,.,.)) +p^,A(p,,,.) - p^^^..A(p,j.) - (p^A(p^) - p^ A(p«)) + p,,,. A(p,,,. ) " P^A(pr) 

< (^(^^ir^) +2niaxi<j<„^ Ipj.jA(pij) -pijA(pij)|. 

/ (123) 

In (123), A is the observed demand rate, and the last inequality is due to the definition of 
p" and that p" is among one of the p^j-. 

By Lemma [W] in Appendix |10.2[ we have 

P(|A(p,,,) - A(p,,,)| > Cyiog^ • . /^) < \ (124) 
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with some suitable constant C. Therefore, with probabiUty 1~'0{^), r(A(p")) — r(A(p")) < 
CmJj. However, by our assumption that — p\ < |r(A(p")) — r{\{p))\, with probabihty 

i-o(^), b"-pri<c<. 

Now we consider the distance between pi and p^. Assume Pi,j* is the nearest grid point to 
pI- Then we have 

\Kp'i)~A < \m)-^\ + fm)-m)\ 

< \x{p,,,.)-x\ + \x{p'i)-xm 

< iA(p..,.) -x\ + |A(p,,,.) - A(p.,,oi + \m) - m)\ 

< \X{Pi) - x\ + \X{p'1) - X{pi^j.)\ + 2maxi<j<K, \MPi,j) - HPi,])\- 

(125) 

And by the definition of p^, A(p^) — a; and A(p^) — x must have the same sign, otherwise 
there exists a point in between that achieves smaller \X{p) — x\. Therefore we have 

|A(pD - Xip'r)\ < \X{p1) ~ A(p,,,-)l + 2 max |A(p.,,) - Afe,,)|. (126) 
By the Lipshitz continuity of A, we have 

|a(pD-a(p.,,.)I<^V^- (127) 



Also by Lemma 14 in Appendix 10.2 we have with probability 1 — O (- 



max \X{p,.j)-X{p,,,)\<C^/k^- J — . (128) 
i<j<iii y jiTj 

Therefore, with probability 1 — O (;^), we have 

\X{pt)-X{p'r)\<Cul (129) 
and by the Lipschitz continuity of j^(A), this implies that with probability 1 — O (^)j 

\pt-p1\<Cn^. (130) 

Therefore, we have 

< p{m - P-\ > c<} + PM -p1> ck} <o(^ 



(131) 



Here we used the fact that: 

I max{a, c} — max{6, d}\ > u ^ \a — b\ > u oi \c — d\ > u. 



Note that ( 131 1 is equivalent of saying that 

P(p^ e [p, - Cul^,p, + C<]) > 1 - o (^i^ . (132) 

Now also note that the interval Ji+i in our algorithm is chosen to be 

_ lognK-g, lognP.-g 



which is of order \/log n greater than y' (and according to the way we defined Kj and 
Ti, the two terms in are of the same order). Therefore we know that with probability 

i-0(i),P^e/.+i. 
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• < p": In this case, — jf . With the same argument, but only the p"^ part, we know 
that with probabihty 1 — 0(^),p^ S J^+i 

Also, as claimed in the previous lemma, the number of steps N doesn't depend on n when 
(5 < I is fixed. Therefore, we can take a union bound over A'' steps, and claim that with 
probability 1 - O (i), € /„ for all i = 1, N. □ 

Now we have proved that with high probability, pP will always be in our interval. Next we 
will analyze the revenue collected by this algorithm. 

Define Yij to be the Poisson random variable with parameter A(pi j)nAi (Fjj = N{X{pi j)nAi). 
Also define F to be a Poisson random variable with parameter A(pjv)n(l— tjv) {Y = N{X{pN)n{l— 
In)). We define the following event: 



Ai — {oj : Yij < nx}. 



We have 



N 



J:{x,T;\) > +i?[pmin(f,(nx-^>-,)+)/(^i)]. (133) 

i=i j=i ij 



In the following, we will consider each term in ( 133 1. We will show that the revenue collected 
in both parts is "close" to the revenue generated by the optimal deterministic price p^ in its 
corresponding part (and the consumed inventory is also near-optimal). We first have: 

Lemma 17. 

N K.i N 

i^E5Zp».j>^^,/(Ai)] > ;^p^A(p^)nr, - Cnn 

i—1 j — 1 i — 1 

Proof. The proof of this lemma is almost the same of the proof of Lemma [4| in Appendix [To3 



(134) 



except that in (80), mLK{pi — P-)^ is replaced by L{pi ~P-)- In (81), We use the relationship 
defined in (117)7Tlie result follows. □ 



Now we look at the other term in (133). We have 

i;[pwmin(y,(nx~^y,j) + )] = S[p^(F-max(y-(na:-^r,,) + ,0))] 

> E[pNY]^E[pNiY + Y,Y^j 



nx 



For E[p]\[Y] — E[pNX{pN)n(l — t^)], we apply the same argument as we proved in Lemma 
T7\ and we have: 

EHXipN))] > r(A(p^)) - CuN, (135) 
/^}. Therefore, 



where ujv = V^ogn ■ max{ 

E[pnY] > p°X{p°) ■ n(l - <„) - Cnujv > P°X{p^) ■ n{l - t„) - Cnri. (136) 
Now we consider 



E\pN{Y + J2Y^J~nx)+]. 



(137) 



h3 



First we relax this to 



We claim that: 



pE{Y + Yij - nx\ 
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Lemma 18. 

E{Y + Yij -EY EYij)+ < Cnn , (138) 

where C is a properly chosen constant. 
and 

Lemma 19. 

Y EY,] +EY -nx< Cnn, (139) 

where C is a properly chosen constant. 

The proof of Lemma [T8| is exact the same as the proof of Lemma [5] For Lemma [T9j we have: 

EY,, = A(p,,j>A„ (140) 
EY = X{pNMl~t„). (141) 



By our assumption that is in the interior and Lemma 16 we know that with probability 
l"0(^),p'^ <|3j for ah our price ranges. Therefore, with probabihty 1 — O (^), we have 

f-p^,,<P,-p. (142) 

By the Lipschitz condition on A(p), this imphes that X{pij) — Mp'^) ^ ^ Pi^' Therefore, 
with probabihty 1 — O (^), 

J:.^ J EY,, - nxtN = ^{EY,, - X{p^)nA,) 

< CnE.,,(p.-p,)A. (143) 
= CnTi. 

Similarly, we have that with probability 1 — O (;^), 

A(p^)-A(p^)< ^^'-^ <Cri. 

And therefore, 

EY - nx{l - tN) = EY - A(p'=)n(l - t„) < Cnri. (144) 

Thus the lemma holds. □ 

We combine Lemma [TT} [T8| and [T9l Theorem [3] holds. 
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